29 research outputs found

    Reliability for exascale computing : system modelling and error mitigation for task-parallel HPC applications

    Get PDF
    As high performance computing (HPC) systems continue to grow, their fault rate increases. Applications running on these systems have to deal with rates on the order of hours or days. Furthermore, some studies for future Exascale systems predict the rates to be on the order of minutes. As a result, efficient fault tolerance solutions are needed to be able to tolerate frequent failures. A fault tolerance solution for future HPC and Exascale systems must be low-cost, efficient and highly scalable. It should have low overhead in fault-free execution and provide fast restart because long-running applications are expected to experience many faults during the execution. Meanwhile task-based dataflow parallel programming models (PM) are becoming a popular paradigm in HPC applications at large scale. For instance, we see the adaptation of task-based dataflow parallelism in OpenMP 4.0, OmpSs PM, Argobots and Intel Threading Building Blocks. In this thesis we propose fault-tolerance solutions for task-parallel dataflow HPC applications. Specifically, first we design and implement a checkpoint/restart and message-logging framework to recover from errors. We then develop performance models to investigate the benefits of our task-level frameworks when integrated with system-wide checkpointing. Moreover, we design and implement selective task replication mechanisms to detect and recover from silent data corruptions in task-parallel dataflow HPC applications. Finally, we introduce a runtime-based coding scheme to detect and recover from memory errors in these applications. Considering the span of all of our schemes, we see that they provide a fairly high failure coverage where both computation and memory is protected against errors.A medida que los Sistemas de Cómputo de Alto rendimiento (HPC por sus siglas en inglés) siguen creciendo, también las tasas de fallos aumentan. Las aplicaciones que se ejecutan en estos sistemas tienen una tasa de fallos que pueden estar en el orden de horas o días. Además, algunos estudios predicen que los fallos estarán en el orden de minutos en los Sistemas Exascale. Por lo tanto, son necesarias soluciones eficientes para la tolerancia a fallos que puedan tolerar fallos frecuentes. Las soluciones para tolerancia a fallos en los Sistemas futuros de HPC y Exascale tienen que ser de bajo costo, eficientes y altamente escalable. El sobrecosto en la ejecución sin fallos debe ser bajo y también se debe proporcionar reinicio rápido, ya que se espera que las aplicaciones de larga duración experimenten muchos fallos durante la ejecución. Por otra parte, los modelos de programación paralelas basados en tareas ordenadas de acuerdo a sus dependencias de datos, se están convirtiendo en un paradigma popular en aplicaciones HPC a gran escala. Por ejemplo, los siguientes modelos de programación paralela incluyen este tipo de modelo de programación OpenMP 4.0, OmpSs, Argobots e Intel Threading Building Blocks. En esta tesis proponemos soluciones de tolerancia a fallos para aplicaciones de HPC programadas en un modelo de programación paralelo basado tareas. Específicamente, en primer lugar, diseñamos e implementamos mecanismos “checkpoint/restart” y “message-logging” para recuperarse de los errores. Para investigar los beneficios de nuestras herramientas a nivel de tarea cuando se integra con los “system-wide checkpointing” se han desarrollado modelos de rendimiento. Por otra parte, diseñamos e implementamos mecanismos de replicación selectiva de tareas que permiten detectar y recuperarse de daños de datos silenciosos en aplicaciones programadas siguiendo el modelo de programación paralela basadas en tareas. Por último, se introduce un esquema de codificación que funciona en tiempo de ejecución para detectar y recuperarse de los errores de la memoria en estas aplicaciones. Todos los esquemas propuestos, en conjunto, proporcionan una cobertura bastante alta a los fallos tanto si estos se producen el cálculo o en la memoria.Postprint (published version

    A runtime heuristic to selectively replicate tasks for application-specific reliability targets

    Get PDF
    In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.This work was supported by FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and in part by the European Union (FEDER funds) under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Designing and modelling selective replication for fault-tolerant HPC applications

    Get PDF
    Fail-stop errors and Silent Data Corruptions (SDCs) are the most common failure modes for High Performance Computing (HPC) applications. There are studies that address fail-stop errors and studies that address SDCs. However few studies address both types of errors together. In this paper we propose a software-based selective replication technique for HPC applications for both fail-stop errors and SDCs. Since complete replication of applications can be costly in terms of resources, we develop a runtime-based technique for selective replication. Selective replication provides an opportunity to meet HPC reliability targets while decreasing resource costs. Our technique is low-overhead, automatic and completely transparent to the user.This work is supported in part by the European Union Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and the FEDER funds under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Spatial support vector regression to detect silent errors in the exascale era

    Get PDF
    As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs) or silent errors are one of the major sources that corrupt the executionresults of HPC applications without being detected. In this work, we explore a low-memory-overhead SDC detector, by leveraging epsilon-insensitive support vector machine regression, to detect SDCs that occur in HPC applications that can be characterized by an impact error bound. The key contributions are three fold. (1) Our design takes spatialfeatures (i.e., neighbouring data values for each data point in a snapshot) into training data, such that little memory overhead (less than 1%) is introduced. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show thatour detector can achieve the detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% of false positive rate for most cases. Our detector incurs low performance overhead, 5% on average, for all benchmarks studied in the paper. Compared with other state-of-the-art techniques, our detector exhibits the best tradeoff considering the detection ability and overheads.This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research Program, under Contract DE-AC02-06CH11357, by FI-DGR 2013 scholarship, by HiPEAC PhD Collaboration Grant, the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402, and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Unified fault-tolerance framework for hybrid task-parallel message-passing applications

    Get PDF
    We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the FI-DGR 2013 scholarship and the European Community’s Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2 Project (www.montblanc-project.eu), grant agreement no. 610402 and TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Exploring the capabilities of support vector machines in detecting silent data corruptions

    Get PDF
    As the exascale era approaches, the increasing capacity of high-performance computing (HPC) systems with targeted power and energy budget goals introduces significant challenges in reliability. Silent data corruptions (SDCs), or silent errors, are one of the major sources that corrupt the execution results of HPC applications without being detected. In this work, we explore a set of novel SDC detectors – by leveraging epsilon-insensitive support vector machine regression – to detect SDCs that occur in HPC applications. The key contributions are threefold. (1) Our exploration takes temporal, spatial, and spatiotemporal features into account and analyzes different detectors based on different features. (2) We provide an in-depth study on the detection ability and performance with different parameters, and we optimize the detection range carefully. (3) Experiments with eight real-world HPC applications show that support-vector-machine-based detectors can achieve detection sensitivity (i.e., recall) up to 99% yet suffer a less than 1% false positive rate for most cases. Our detectors incur low performance overhead, 5% on average, for all benchmarks studied in this work.This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under Award Number 66905, program manager Lucy Nowell. Pacific Northwest National Laboratory is operated by Battelle for DOE under Contract DE-AC05-76RL01830. In addition, this material is based upon work supported by the National Science Foundation under Grant No. 1619253, and also by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, program manager Lucy Nowell, under contract number DE-AC02-06CH11357 (DOE Catalog project) and in part by the European Union FEDER funds under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft

    Treatment of vascular injuries associated with limb fractures

    No full text

    Socıal Transformatıon In Safevıd Admınıstratıon

    No full text
    On altıncı yüzyılın başlarından on sekizinci yüzyılın ortalarına kadar tarih sahnesinde kalan Safevî Devleti, Balkanlar’dan Orta Asya’ya, Kafkaslardan Hürmüz Boğazı’na kadar çok geniş bir coğrafyada bugün de etkileri devam eden değerli bir tarihsel figürdür. Bu nedenle İran, Türkiye, Avrupa, Japonya ve Amerika’da Safevîler üzerine dikkat çekici bir akademik ilgiden söz edilebilir. Çünkü tarih sahnesinde dinî bir tarikatın çekirdeğini oluşturduğu bu denli güçlü bir siyasi yapı sıkça karşılaşılabilecek bir olgu olmadığı gibi görmezden gelinebilecek bir durum da değildir. Gerçeği söylemek gerekirse Safevî araştırmalarının asıl zorluğu da sözü edilen mistik ve cezbedici çekirdek ile bunun tarihsel yansımalarından kaynaklanmaktadır. Tarihî Azerbaycan ve günümüz İran’ında geçerli olan Safevîlerle halef selef ilişkisi, Türkiye için Osmanlı mirasından ve mezhepsel yapıdan kaynaklanan kadim ötekilik durumu ve son olarak Müslüman olmayan ülkelerde -özellikle Batı’da- İslâm’ın heterodoks yüzünün siyasi temsilcisi olması vasfı, Safevî çalışmalarını çok katmanlı ve ideolojik önyargılara açık bir zemine kaydırabilmektedir. Bu nedenle Safevîlerle ilgili çalışmalar, ya kronolojik bir siyasi tarih düzleminde ya da söz konusu ideolojik bakışların yönlendirmesinde belirli bir güzergâha mahkûm gibidir. Çalışmamızda bizden öncekileri tekrar etmemek adına Safevî tarihinin kronolojik boyutu ana hatlarıyla sınırlı tutulmuştur. Araştırmanın merkezine Safevî Devleti’nin, Karakoyunlular ve Akkoyunlular gibi kendilerinden önceki Türk devletleriyle etkileşimleri, idari yapılanma, ordu yapısı, etnik, kültürel ve ekonomik yapılar açısından sebep sonuç ilişkisi gözetilerek v yerleştirilmiştir. Böylelikle Safevîlerin varoluş süreci bir devamlılık iddiasına oturtulmuştur. Böylece Karakoyunlular ve Akkoyunluların askerî, idarî, etnik ve ekonomik özelliklerinin Safevîler’deki yansımaları somut verilere dayandırılmıştır. Bahsedilen iki Türkmen devleti ile Safevîler arasındaki temel farklılık olan Safevî Tarikatı’nın devlet içindeki konumu -devletleşme öncesi ve sonrasıyla birlikte- tarihsel gelişimi içinde ele alınmıştır. Safevî Devleti tarihi, Şah İsmail’den Tahmasb’ın birinci dönemine kadar I. Dönem olarak adlandırılmış ve bu adlandırmanın idari, etnik ve dinî sebepleri ortaya konmuştur. Erdebil Tekkesi etrafında örgütlenen Kızılbaş Türkmenler tarafından kurulan devletin Tahmasb’la başlayan değişimi açıklanmış, bu değişimin hem devlete hem de inanç sitemine nüfuz ediş süreci izah edilmiştir. Bu süreç, yalnızca bölgedeki demografik yapının değil; Hazar’ın Batısından Anadolu içlerine, Kafkasların eteklerinden Arap Yarımadası kapılarına kadar bölgenin inanç haritasının da değişimine etki etmiştir. Bu gerçeklik sözü edilen bölgenin siyasi, etnik ve inanç tarihinin aydınlatılması adına hayatidir. Safevîler hakkındaki bu çalışma, başta İran olmak üzere Osmanlı ve Cumhuriyet’in, başta Türkmenler olmak üzere ilgili tüm etnik yapıların, en çok da Kızılbaşlık ve Şia’nın daha iyi anlaşılmasına hizmet edecektir.Remained on the stage of history from the early of 14th century to the middle of 18th century, Safavid Empire is a valuable historical figure of which effects are also continuing today on a too wide geography from the Balkans to Middle Asia, the Caucasus to the Strait of Hormuz. For this reason, a remarkable academic interest in the Safavids can be mentioned in Iran, Turkey, Europe, Japan and USA. Because, on the stage of history, such a strong political structure of which core is constituted by a religious sect is neither a fact that would be able to be encountered frequently, nor a situation that would be able to be neglected. To be truthful, the real difficulty of Safavian studies is due to the said mystic and charming core and the historical reflections of this. Successor-predecessor relationship with the Safavids, which is current in the historical Azerbaijan and today’s Iran; state of the old otherness arising from Ottoman heritage and denominational (sect) structure, for Turkey; and finally, feature of its being political representative of heterodox face of Islam in the non-muslim countries ̶ particularly in the West; can slide Safavian studies to a platform that is vulnerable to multi-layer and ideological prejudices. Therefore, studies on the Safavids are like the doomed to a certain course either on a chronological plane of political history or in directing the said ideological points of view. In our study, we confined chronological dimension of Safavid Empire to broad strokes in order not to repeat those before us. We put interactions of Safavid Empire with Turkish states such as the Qara-Qoyunlus and Aq-Qoyunlus who came before the Safavids, into the center of the study, by paying regard to the cause and effect relationship in terms of administrative structuring, army structure, and ethnical, cultural and economical structures. vii Thus, we placed existence process of the Safavids into a claim of continuity. So, reflections of military, administrative, ethnical and economical features of the Qara-Qoyunlus and Aq-Qoyunlus on the Safavids were based on tangible data. We addressed position, within the the state, of Safavid Sect which is essential difference between the said two Turkmen states and the Safavids ̶ together with before and after state formation ̶ within its historical development. We named the history of the Safavid Empire as ‘First Period’ from Shah Ismail I to the first period of Shah Tahmasb I and put forward administrative, ethnical and religious reasons of this naming. We explained the change which began with Shah Tahmasb I in the state that was founded by the Qizilbash Turkmen who were organized around Ardabil Lodge, and we showed forth penetration process of this change into both the state and belief system. This process had an effect on the change not only in demographic structure in the region, but also in belief map of the region, from Western Part of Khazar into the inlands of Anatolia, mountain foots of the Caucasus to the gates of Arabian Peninsula. This reality is vital to enlighten political, ethnical and belief-related history of the above-mentioned region. This study which is about the Safavid Empire, will help to understand better the Ottoman and the Republic of Turkey, in particular to Iran, the relevant all ethnical structures, including the Turkmen, and Qizilbashness and the Shi’a none more so than
    corecore